Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【feature】Commit Message: Optimized PyMuPDFScraper to handle invalid o… #1012

Merged
merged 1 commit into from
Dec 14, 2024

Conversation

MC-shark
Copy link

@MC-shark MC-shark commented Dec 9, 2024

1、Enhanced error handling in PyMuPDFScraper to address issues where URLs with invalid links or protective mechanisms (e.g., rate-limiting, CAPTCHA) caused the scraper to hang indefinitely.
2、Introduced proper exception handling to raise errors when such conditions are encountered, ensuring the system remains stable and responsive.
3、Added logging to capture detailed error information for better troubleshooting and monitoring.
4、Tested and confirmed that the optimizations work as expected, effectively preventing system crashes and ensuring smooth operation.

…r defense-mechanism-protected URLs more efficiently, preventing long delays and system crashes.
@MC-shark
Copy link
Author

MC-shark commented Dec 9, 2024

such as https://www.tesla.com/ns_videos/Tesla-Master-Plan-Part-3.pdf, which caused the scraper to hang indefinitely.

Copy link
Owner

@assafelovic assafelovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great thank you! What improvements do you see in the report results now?

@assafelovic assafelovic merged commit 99d65b0 into assafelovic:master Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants